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Abstract — Suppose a linear model y = Hx + n, where inputs 
x, n are independent Gaussian mixtures. The problem is to design 
the transfer matrix H so as to minimize the mean square error 
(MSE) when estimating x from y. This problem has important 
applications, but faces at least three hurdles. Firstly, even for a 
fixed H, the minimum MSE (MMSE) has no analytical form. 
Secondly, the MMSE is generally not convex in H. Thirdly, 
derivatives of the MMSE w.r.t. H are hard to obtain. This paper 
casts the problem as a stochastic program and invokes gradient 
methods. 

The study is motivated by two applications in signal processing. 
One concerns the choice of error-reducing precoders; the other 
deals with selection of pilot matrices for channel estimation. In 
either setting, our numerical results indicate improved estimation 
accuracy - markedly better than those obtained by optimal design 
based on standard linear estimators. 

Some implications of the non-convexities of the MMSE are 
noteworthy, yet, to our knowledge, not well known. For example, 
there are cases in which more pilot power is detrimental for 
channel estimation. This paper explains why. 

Index Terms — Gaussian Mixtures, minimum mean square 
error (MMSE), estimation 



I. Problem statement 
Consider the following linear system 

y = Hx 



n. 



(1) 



Here y is a vector of observations, and x and n are mutually 
independent random vectors with known Gaussian Mixture 
(GM) distributions: 
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In this work, we assume that H is a transfer matrix that 
we are at liberty to design, typically under some constraints. 
Specifically, our objective is to design H such that x can be 
estimated from y with minimum mean square error (MMSE). 
The MMSE, for a fixed H, is by definition fl] 



MMSE = E{\\x- u x | y ||^} 

l x - u x | y ||^/(x,y)dxdy. 



(4) 
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Here, ||-|| 2 denotes the 2-norm, /(x,y) is the joint probability 
density function (PDF) of (x, y), 

u x , y ^£{x|y} = J x/(x|y)dx (5) 

is the MMSE estimator, and /(x|y) is the PDF of x given 
y. The MMSE in equation (@]l depends on H both through 
u x | y and /(x, y). Our objective is to solve the following 
optimization problem 
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(6) 



where H denotes a prescribed set of matrices that H must 
belong to. Solving this optimization problem is not straightfor- 
ward. In particular, three hurdles stand out. Firstly, with (O and 
(01 as inputs to (Q}, the MMSE in (0]i has no analytical closed 
form [2 1. Thus, the effect of any matrix H, in terms of MMSE, 
cannot be evaluated exactly. Secondly, the MMSE is not 
convex in H. Thirdly, the first and second order derivatives of 
the MMSE w.r.t H cannot be calculated exactly, and accurate 
approximations are hard to obtain. For these reasons, and in 
order to make progress, we cast the problem as a stochastic 
program and invoke the Robbins-Monro algorithm Q, fl4]. 
Very briefly our approach goes as follows: We draw samples 
from x and n and use these to compute stochastic gradients 
of the MMSE. These feed into an iterative gradient method 
that involves projection. 

The contributions of the paper are several: 

• As always, for greater accuracy, its preferable to use 
gradients instead of finite difference approximations. For 
this reason the paper spells out a formula for exact real- 
ization of stochastic gradients. Accordingly, the Robbins- 
Monro algorithm comes to replace the Kiefer-Wolfowitz 
procedure. 

• In the design phase, we exploit the known input statistics 
and update H based on samples of the inputs (x, n), 
instead of output y. This yields a closed form stochastic 
gradient, and we prove that it is unbiased. 

> Numerical experiments indicate that our method has far 
better accuracy than methods which proceed via linear 
estimators. The main reason is that the optimal estimator, 
used here, is non-linear. 

• It turns out that the non-convexities of the MMSE may 
have practical implications that deserve being better 
known. Specifically, in channel estimation, it can be 
harmful to increase the power of the pilot signal. This 
paper offers an explanation. 

Clearly, in many practical problems, the quantities in (Q~|i are 
complex-valued. Throughout this paper, however, they will all 
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be assumed real. For the analysis, this assumption introduces 
no loss of generality, as the real and imaginary parts of (fl3 
can always be treated separately. 

The paper is organized as follows. The next section outlines 
two applications. It also specifies the Gaussian mixtures and 
motivates their use. Section [ill] illustrates the problem by 
means of a simple example. Section [IV] spells out problem 
(O in full detail. Section |V] reviews how the Robbins-Monro 
method applies. Numerical results are provided in Section [VTl 
Section IVHI concludes. A large part of the detailed analysis, 
concerning stochastic gradients, is deferred to the appendix. 

II. Background and Motivation 

The above described matrix design problem appears in 
various applications of interest. Next we present two of 
these, which are of particular interest to the signal processing 
community. Then we will explain and motivate the GM input 
statistics. 

A. Linear precoder design 
Consider a linear system model 

y=BFx + n, (7) 

H 

where B is a known matrix, and F is a precoder matrix to 
be designed such that the mean square error (MSE) when 
estimating x from y becomes as small as possible. The vector 
n is random noise. If x and n are independent and GM 
distributed, this is a matrix design problem as described in 
SectionH] where F is the design parameter. A typical constraint 
is to require that Fx cannot exceed a certain average power, 
i.e. E 1 1 Fx 1 1 2 < 7- Together with the nature of B, this 
determines EI in (O. 

A linear model with known transfer matrix H and GM dis- 
tributed inputs is frequently assumed within speech and image 
processing. In these applications, the signal of interest often 
exhibits multi modal behavior. That feature can be reasonably 
explained by assuming an underlying GM distribution. The 
noise is often modeled as Gaussian, which is a special case 
of a GM. Conveniently, with © and (O as inputs, the MMSE 
estimator in (|5]l has a closed analytical form for any given H. 
Selected works exploiting this include lO-lUl. However, none 
of these works study MSE reducing precoders. These have the 
potential to significantly improve the estimation accuracy, and 
should therefore be of interest. 

B. Pilot signal design 

Consider a multiple-input-multiple-output (MIMO) commu- 
nication model 

z = As + n, (8) 

where A is a random channel matrix that we wish to estimate 
with as small MSE as possible, and s is a pilot signal 
to be designed for that purpose. As before, n is random 
noise. In order to estimate A with some confidence, we must 
transmit as least as many pilot vectors as there are columns 
in A. In addition we must assume that the realization of 



A does not change before all pilots have been transmitted. 
This assumption typically holds in flat, block-fading MIMO 
systems [10|-[12|. With multiple transmitted pilots, model ([8]) 
can be written in matrix form as 

Z = AS + N. (9) 

If A is to x n, then this model can be vectorized into (Thm. 
2, Ch. 2, El) 

vec(Z) = (S T ® I m ) vec(A) + vec(N) . (10) 

y H x n 

Here I m denotes the mxm identity matrix, the vec(-) operator 
stacks the columns of a matrix into a column vector, and ® 
denotes the Kronecker product. Assuming that the channel 
(x) and noise (n) are independent and GM distributed, this 
is again a design problem as described in Section U where the 
pilot matrix S is the design parameter. A natural constraint, 
is to impose power limitations on the transmitted pilots, 
i.e. ||S|| 2 < 7. Together with the structure imposed by the 
Kronecker product, this then determines H in ©■ 

In ( TTOT ). one may either assume that n is pure background 
noise, or that n represents noise and interference. In the former 
case a Gaussian distribution may be justifiable, whereas in 
the latter a GM distribution may be more suitable [14|, |15|. 
As for the channel x, a GM distribution can account for 
multiple fading situations. This can be useful, for example 
if the source is assumed to transmit from multiple locations. 
Then, the commonly used Rice distribution is unlikely to 
accurately capture the channel statistics associated with all 
transmit locations (especially so in urban areas). In fact, in |[T6l 
it has been experimentally observed and reported that different 
transmit locations are indeed associated with different channel 
statistics. A GM distributed channel, with multiple modes, has 
the potential to capture this. 

The assumption that a channel realization can originate from 
several underlying distributions is not novel. For instance, 
all studies assuming channels governed by a Markov Model 
make this assumption, see e.g. flTl . [18| and the references 
therein. A GM is a special case of an Hidden Markov model, 
where subsequent observations are independent, rather than 
governed by a Markov process. In spite of this, to the best 
of our knowledge, pilot optimization for estimating channels 
governed by a GM distribution has not been considered in the 
literature. 

C. Gaussian Mixture distributions 

While aimed at minimizing the MSE, most optimization 
studies on linear precoders [19|, [20] or pilot signals [10|- 
|[T2l utilize only the first and second moments of the input 
distributions. Commonly, the underlying motivation is that a 
linear MMSE (LMMSE) estimator is employed. The LMMSE 
estimatoiQ only relies on first and second order statistics, 
which conveniently tends to simplify the associated matrix 
design problem. In fact, the desired matrix can often be 

'Among all estimators which are linear (affine) in the observations, the 
LMMSE estimator obtains the smallest MSE. 
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obtained as the solution of a convex optimization problem. 
It is known, however, that the LMMSE estimator is optimal 
only for the special case when the random signals x and n 
are both Gaussian. For all other cases, the LMMSE estimator 
is suboptimal. 

In practice, purely Gaussian inputs are rare. In general, the 
input distributions may be asymmetric, heavy tailed and/or 
multi modal. A type of distribution that can accommodate 
all of these cases is the Gaussian Mixture (GM) distribution. 
In fact, a GM can in theory represent any distribution with 
arbitrary accuracy lOTl . (22). Therefore, in this work, we 
assume that the inputs are GM distributed as in (O and (0. 
Notation (O should be read in the distributional sense, where 
x results from a composite experiment. First, source k £ K, 
is activated with probability pk > 0, YlkelcPk ~ ^ Second, 
that source generates a Gaussian signal with distribution law 
A/"(u x fc \ Cxx)- For any realized x, however, the underlying 
index k is not observable. The noise n emerges in an entirely 
similar, but independent manner. JC and C are index sets. In 
theory, it suffices that these sets are countable, but in practice 
they must be finite. Clearly when K, and C are singletons, one 
falls back to the familiar case of Gaussian inputs. 

The mixture parameters, e.g. (j>k, Ux , cS)s;e/c. are rarely 
given a priori. Most often they must be estimated, which is 
generally a non-trivial task ETl . [23 1. A common approach is 
to estimate the GM parameters from training data. The expec- 
tation maximization (EM) algorithm is well suited, and much 
used, for that purpose [lj, [24|, [25|. Briefly, the algorithm 
relies on observations drawn from the distribution we wish 
to parametrize, and some initial estimate of the parameters. 
The observations are used to update the parameters, iteratively, 
until convergence to a local maximum of the likelihood 
function. Because the resulting GM parameters depend on the 
initial estimates, the algorithm can alternatively be started from 
multiple initial estimates. This produces multiple sets of GM 
parameters, and each set can be assigned probabilities based 
on the training data. Our starting point is that the distributions 
in © and © have resulted from such, or similar model fitting. 

Model (Q~|i with GM inputs (O and (01 is quite generic, 
and we have indicated two signal processing applications 
where the matrix design problem appears. The solution to that 
problem is, however, essentially application independent. It 
should therefore be of interest to a wide audience. To the best 
of our knowledge, it has not been pursued in the literature. 

III. An illustrative example 

We start by studying a special instance of the matrix design 
problem, where the MMSE for all H S H can be plotted. In 
general this is not possible, but the following simple example 
reveals some fundamental properties of the problem. Assume 
that we wish to design a precoder, as in (£7), where B = 1 2 and 
F is restricted to be an orthogonal matrix. Equation (Q then 
simplifies to y = Fx + n, where F only rotates x. Further, 
let x and n be independent and identically GM distributed as 

-Af {ae x , I 2 ) + -N (-ae x , I 2 ) , 



where a is a scalar and is the unit vector along the x- 
axis. Assume initially that F = I2, which corresponds to no 
rotation. In this case, Figure Ufa) illustrates the densities of Fx 
(full circles) and n (dashed circles), when seen from above. 
They are identical and sit on top of each other. Now, with a 




Fig. 1. (a): Densities without any rotation, (b): The effect of rotating x by 
tt/2. 

precoder that rotates x by tt/2, the densities of Fx and n will 
look like in Figure |TJb). The latter configuration is preferable 
from an estimation viewpoint. This is clear from figure [2] 
where the MMSE is displayed as a function of all rotation 
angles between and 2tt (with a = 2). As can be seen, a 
significant gain can be obtained by rotating tt/2 (or by 3ir/2). 
This gain is not due to a particularly favorable signal-to-noise- 
ratio SNR = E ||Fx||2 /E 1 1 n 1 1 ^ ; because F is orthogonal, the 
SNR remains equal for all rotation angles. The MMSE gain 
is instead due to a rotation producing a signal which tends to 
be orthogonal to the noise. 

The above example is a special case of the matrix design 
problem, where H in (Q]i is restricted to be orthogonal. It is 
clear that H plays a decisive role in how accurately x can be 
estimated. An almost equally important observation, however, 
is that the MMSE is not convex in H. Hence, in general, we 
cannot expect that first order optimality (zero gradient) implies 
a global minimum. When studying the channel estimation 
problem further, we will see an implication of this non- 
convexity, which is perhaps not well known: In certain cases 
the MMSE of the channel estimate does not decrease with 
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1.5 • 1 • 1 1 1 1 

1 2 3 4 5 6 7 

Rotation angle [radians] 

Fig. 2. MMSE versus rotation angle. 
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increasing pilot power. On the contrary, the MMSE may in 
fact increase. 

In the next section we rewrite the original minimization 
problem into an equivalent but more compact maximization 
problem. Then, in Section, [V] we present a stochastic opti- 
mization approach which provides a solution. 

IV. An equivalent maximization problem 

In order to propose a solution to the matrix design problem 
in ©, we first rewrite expression Using the results of [2|, 
it follows that for model ((TJ, under independent GM inputs 
© and (01, and a fixed H, the MMSE can be written as 



£{||x-£{x|y}||I} = 

Eft ( tr ( c ~ 



,(*0 



ix| y || 2 /(y)dy- 
(ii) 



In (fTTT i. tr(-) denotes the trace operator and f(y) is a (GM) 
probability density function 



/(y) = X)pfc«i/ (fc,0 (y) > 



k.I 



where 



■ (fcJ, ) r c^ M) (y- 



/wj( y ) = r 

1V1 

c (fc,0 



^yy 



yy 



hcWh t + c1'1. 



(12) 



(13) 

(14) 
(15) 



In (fT3l . ( ) T denotes transposition, | | denotes the determinant, 
Cyy^' is short for (C^ )" 1 and M is the length of y. The 



MMSE estimator u x | y in (fTTl i can be written as 



/(y) 



(16) 



where 



u 



(k,i) 
x|y 



u 



(AO 



cS^C^fy-uW)). (17) 



In what follows, it is convenient to define 
G(H,y) = ||u x | y || 2 . 



(18) 



This notation emphasizes that the squared norm of the MMSE 
estimate depends on both H and the observation y. In ( fTTT l. 
only the integral depends on H. Exploiting this, and using 
( HE) , the minimizer of the MMSE is that H which maximizes 



J G(H, y)/(y)dy - E [G(H, y)] =: g(U), (19) 



subject to 



H e 



(20) 



The integral in ( fl9] > cannot be evaluated analytically, even 
for a fixed and known H Q. Moreover, as the example in 
SectionlHilillustrated. the MMSE is generally not convex in H, 
which implies that g(H) is generally not concave. Hence, any 
optimization method that merely aims at first order optimality, 



does in general not produce a global maximizer for <?(H). 
Finally, as argued in the appendix, neither first or second order 
derivatives of ,g(H) w.r.t. H can be computed exactly, and 
accurate approximations of these are hard to obtain. 

V. The Robbins-Monro solution 

The above observations suggest that a sampling based 
approach is the only viable option. The problem of maximizing 
a non-analytical expectation E[G(H,y)], over a parameter 
H, falls under the umbrella of stochastic optimization. In 
particular, for our problem, the Robbins-Monro algorithm Q, 
101, can be used to move iteratively from a judicially chosen 
initial matrix Ho to a local maximizer H*. The philosophy 
is to update the current matrix H using the gradient of 
the MMSE. Since the gradient cannot be calculated, one 
instead relies on a stochastic approximation. Translated to our 
problem, the idea is briefly as follows. Although (fT9l cannot 
be computed analytically, it can be estimated from independent 
sample vectors {y,; = Hx^ + n^}^, as 

ff(H)«4si 



x lyi 



(21) 



The derivative of (121) w.r.t. H represents an approximation 



of 



3g(H) 



;ypj— , which can be used to update the current H. Each 
update is then projected onto the feasible set H. This is the 
core idea of the much celebrated Robbins-Monro algorithm 
0. In our context, the algorithm can be outlined as follows. 

• Let the current matrix be H r . 

> Draw at random (x, n) and compute y = H r x + n. 

• Calculate the update direction as 

5G(H r ,y) 



B, 



and 



W r = H r + e r B r , 



(22) 



(23) 



where {e r }^Li is an infinite sequence of step sizes 
satisfying e r > 0, e r — > and YlrLi e r = °°- 
Update the matrix as 



H r+ 1 = 7T H (W r ) , 



(24) 



where 7Th(-) represents the projection onto the set of 
permissible matrices H. 
• Repeat all steps until convergence. 

A. Remarks on the Robbins-Monro algorithm 

Recall that the input statistics (0, (0 are assumed known. 
Therefore, in a design phase, it is reasonable to assume that 
the inputs (x, n) can be sampled to compute y, as indicated 
in the second step of the algorithm. The alternative would 
be to sample y directly, leaving the underlying x and n 
unknown. The first approach is preferred because it guarantees 
that d22b becomes an unbiased estimate of ^f^p-, whereas the 
alternative does not. This important point is fully explained 
in the appendix. In general, the Robbins-Monro procedure 
does not require observing the input realizations (x, n). The 
algorithm converges also when only outputs y are available. 
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For this reason we write (122b in terms of y, but for our 
implementation we will assume that the underlying inputs 
(x, n) are fully known. 

Because the gradient direction 8G ^' y ^ in (l22l is central 
in the algorithm, its closed form expression is derived in the 
appendix. Observe that dG ^ y ^ is random because it relies 
on a random realization of y. Specifically it is a stochastic 
approximation of ^pp- based on a single observation vector 
y. Instantaneously, it may even point in directions opposite 
to ■ In order to increase the likelihood of a beneficial 

update, one can alternatively compute the gradient as an 
average based on multiple y's, as suggested in ( l2Tb . Then 

_ 1 y> 9G(H r jj) 

2 — 1 

In our implementation of the algorithm, however, we do not 
do this. In fact, it was recognized by Robbins and Monro, that 
choosing N large is generally inefficient. The reason is that 
H,. is only intermediate in the calculations, and as argued in 
the appendix, regardless of the value of N, the update direction 
can be chosen such that it coincides with in expectation. 

The Robbins-Monro procedure does not rely on the exis- 
tence of aG g^' y ' ) at all points. If this derivative is discon- 
tinuous, one can instead use any of its sub-gradients; all 
of which are defined. Consequently, if the local maximum 
towards which the algorithm converges has a discontinuous 
derivative, then the algorithm will oscillate around this point. 
Due to the decaying step sizes, however, the oscillations will 
eventually become infinitesimal, and for all practical purposes, 
the system comes to rest. 

Convergence towards a local optimum is guaranteed only as 
r — > oo 0, |4). Therefore, in theory, the algorithm must run 
forever in order to converge. The engineering solution, which 
tends to work well in practice, is to terminate the algorithm 
when j|H,+i — H r || 2 < 7, where 7 is a chosen threshold, 
or simply after a predefined number of iterations. Still, the 
associated running time may be non-negligible, and therefore 
the Robins-Monro procedure is best suited when the input 
signals are stationary. 

In general, for other problems than considered here, it may 
happen that the functional form of G(H, y) is unknown, even 
when its output can be observed for any H and y. In this 
case, 9 qh cannot be computed. Instead one may replace 
it by a finite difference approximation 1261 . In some cases, 
this may also be preferable even when the derivative can 
be computed; Especially so if computing 9G ^ y ^ requires 
much effort. When finite difference approximation are used, 
the procedure is known as the Kiefer-Wolfowitz algorithm 
0], J26]. If the derivative can be computed, however, the 
Kiefer-Wolfowitz algorithm is associated with more uncer- 
tainty (larger variance) than the Robbins-Monro procedure. 
For the interested reader, the present paper extends 12711 . which 
considers Kiefer-Wolfowitz precoding. 

VI. Numerical results 

In this section we will study two specific examples. One is 
on linear precoding, the other is on pilot design for channel 



estimation. In conformance with much of the literature, we will 
use the normalized MSE (NMSE) as performance measurfl 
This is defined as 



NMSE 



£{||x-u x | y ||j} 
£{||x|||} • 



A. Precoder design 

Here we study the performance of a Robbins-Monro pre- 
coder. As in the simple example of Section [Till we restrict the 
precoder to be orthogonal. Thus, the norm of precoded signal 
is equal to that of the non-precoded signal. For the current 
example we choose the following parameters. 

. B = I 2 . 

• x is GM distributed with parameters 
Vk = 1/4, for k = 1...4 

,(2) = 



u« = 

X 
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u 
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The noise is Gaussian and distributed as 





' " 


,a 


' 1 







( 








0.1 


) 



(25) 



where a is a scalar that can account for any chosen 
SNR=tr(C xx ) /tr(C nn ), and C xx and C nn are the co- 
variance matrices of x and n respectively. 
« We use Fo = 1 2 as the initial guess in the Robbins-Monro 
algorithm. 

• As stopping criterion we use: ||F. r+ i — F r || 2 < 10~ 4 . 

Because we have assumed B = I2, we may in the Robbins- 
Monro algorithm of Section [V] simply replace all H r with 
the precoder F r . For the projection in d24l . we choose the 
nearest orthogonal matrix. This projection is the solution to 
the following optimization problem. 

,2 



• r+1 



argmin ||F — W r n 9 
FeO 



where O is the set of orthogonal matrices. The solution is 
particularly simple, and exploits the singular value decompo- 
sition: 

W. r = UDV T => F r+1 = UV T . 

Figure [3] displays the NMSE with and without precoding, 
for increasing SNR levels. As can be seen, Robbins-Monro 
precoding provides a significant NMSE gain, especially at 
SNR levels between and lOdB. Observe that the common 
approach of using the LMMSE estimator (and its correspond- 
ing precoder) is highly suboptimal at intermediate SNR levels. 
In fact, it is much worse than doing MMSE estimation without 
any precoding. 

2 Assuming x to be a zero mean signal, the NMSE is never larger than 1 
(zero dB). The reason is that the MMSE estimator, u x | y , will coincide with 
the prior mean of x only when the SNR tends to zero. Hence, the prior mean 
is a worst case estimate of x, and the NMSE describes the relative gain over 
the worst case estimate. 




SNR [dB] 

Fig. 3. The NMSE with and without preceding. 



The above example indicates that our method generates a 
reasonable precoder, for particular GM input distributions. 
Admittedly, there exists input statistics for which the gain is 
much less significant. However, in all simulations we have car- 
ried out, the clear tendency is that a Robbins-Monro precoder/ 
MMSE receiver outperforms the LMMSE precoder/LMMSE 
receiver at intermediate SNR levels. 

B. Pilot design for channel estimation 

Also for the channel estimation problem, the Robbins- 
Monro pilot matrix/ MMSE receiver outperforms the LMMSE 
pilot matrix/LMMSE receiver at intermediate SNR levels. In 
the next example we choose parameters in order to highlight 
this, and one additional property. That property is a direct 
consequence of the non-convex nature of the MMSE. We 
believe it to be of interest, but not well known. The starting 
point is the channel estimation problem in ©, where we 
assume that all matrices are 2 x 2. In the corresponding 
vectorized model 

vec(Z) = (S T <g> I 2 ) vec(A) + vec(N), (26) 

y II x n 

we assume the following parameters. 

> The vectorized channel, x, is distributed as A/"(0, 14). 

> The vectorized noise, n, is GM distributed with parame- 
ters 

qi = 1/2, for I = 1,2, 
uW=-u( 2 )=5[l 1 1 1] T , 

= C<£ = il 4 (27) 

• As constraint we impose that ||SL = a, where a is 
a positive scalar that can account for any chosen pilot 
power, and therefore also any SNR=||S|| 2 /tr(C nn ). 

• Stopping criterion: ||S r+ i — S,.|| 2 < 10~ 4 . 
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Fig. 4. The NMSE as a function of pilot power. 



As starting point for the Robbins-Monro algorithm we set 
S equal to a scaled identity matrix satisfying the power 
constraint. During the iterations we rely on the following 
simple projection: If the candidate pilot matrix has power 
1 1 S 1 1 2 = 7> then S — > y/^S. Thus, if the pilot matrix does not 
use the entire power budget, the magnitude of all its elements 
are increased. Similarly, if pilot matrix has to large power, the 
magnitude of its elements are decreased. 

Figure |4] shows the estimation accuracy for increasing SNR 
(increasing values of a). It can bee seen that our method 
outperforms the commonly used LMMSE channel estima- 
tor/LMMSE pilot matrix at intermediate SNRs. In fact, for 
the same range of SNRs, the latter is much than transmitting 
a scaled identity pilot matrix and using the MMSE estimator. 
As the SNR increases, however, it is known that the LMMSE 
estimator becomes optimal Q. The performance gap between 
our approach and the LMMSE estimator at high SNR indicates 
that a scaled identity pilot matrix is a local optimum that 
the Robbins-Monro algorithm does not easily escape from. 
The most striking observation in figure [4] however, is that the 
channel estimates may become worse by increasing the pilot 
power! This is not an artifact of the Robbins-Monro algorithm; 
the same tendency is seen when a scaled identity (satisfying 
the same power constraint) is used as the pilot matrix. 

C. Increased pilot power ^ improved channel estimates 

We believe that the above phenomenon is not well known, 
and that it deserves to be explained. In order to visualize what 
happens, we will consider an example of smaller dimensions, 
but with similar properties as in the previous subsection. 
Specifically, we will assume that the unknown channel matrix 
A is 2 x 1, and that the pilot signal is just a scalar, s = a. 
Then, using ( TTOb it follows that H = al2- Thus, in this setup, 
we do not optimize anything, we only try to explain the NMSE 
as function of different values for a. We assume the following 
parameters: 

• H = al2, where a is a scalar that we can vary. 
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Fig. 5. The NMSE as a function of the scalar a. 
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Fig. 6. Sampled observations y for a = 3.45 dB and a = 7.6 dB. 



• The signal (channel) x is distributed as jV(0, I2). 

• The noise n is GM distributed with parameters 

qi = 1/2, for/ = 1,2, 

u« = -u( 2 > = 5 [ 1 1] T , 

= C ( 2 = \l 2 (28) 

In figure [5] the NMSE is plotted as a function of increasing 
values for the scalar a. We observe the same tendency as 
in figure |4] for increasing values of a (corresponding to 
increasing pilot power in figure 0J, the NMSE may increase. 
In figure [5] we also plot the NMSE that would be obtained 
by a genie aided estimator 0. Briefly, the genie aided 
estimator knows from which underlying Gaussian source the 
noise n originates for each observation y. Accordingly it can 
always produce the MMSE estimate corresponding to a purely 
Gaussian model. The genie aided estimator can of course not 
be implemented in practice, but because it is much better 
informed than the MMSE estimator, it provides a lower bound 
on the NMSE. Yet, from figure [5] we see that for a < 3.45 
dB, the MMSE estimator is able to pin-point the correct noise 
component. A plausible explanation is the following. For 
small a, almost all realizations of Hx = al 2 x are close to 
the origin. Thus, observations y tend to appear in two distinct 
clusters; one cluster centered at each noise component. As 
a consequence, the active noise component can essentially 
always be identified. As a grows, Hx = al 2 x take values in 
an increasingly larger area, and the cluster borders approach 
each other. The value a = 3.45 dB is the largest value for a 
where the clusters can still be 'perfectly' separated. This value 
corresponds to the local minimum in figure [5] Because we are 
considering 2-dimensional random vectors, we can actually 
visualize these clusters. The upper part of figure [6] shows how 
400 independent y's form two nearby, but separable, clusters 
generated at a = 3.45 dB. When a grows beyond this level, the 
receiver faces a larger identification problem: it is harder to tell 
which noise component was active. The lower part of figure 
[6] shows 400 independent y's generated at a = 7.6 dB. This 



value corresponds to the local maximum in figure [5] Here the 
clusters largely overlap. As a continues to grow, however, the 
average magnitude of a noise contribution becomes so small 
compared to the average magnitude of alx, that near perfect 
recovery of x eventually becomes possible. 

Translated to the channel estimation problem in figure |4] 
the interpretation is that there is a continuous range where 
increasing the pilot power is harmful. From figure |4] one 
observes that, unless one can spend an additional 15 dB 
(approximately) on pilot power, one will not improve from 
the local minimum solution. 

VII. Conclusion 

We have provided a framework for solving the matrix 
design problem of the linear model under Gaussian mixture 
statistics. The study is motivated by two applications in 
signal processing. One concerns the choice of error-reducing 
precoders; the other deals with selection of pilot matrices 
for channel estimation. In either setting we use the Robbins- 
Monro procedure to arrive at a solution. Our numerical results 
indicate improved estimation accuracy at intermediate SNR 
levels; markedly better than those obtained by optimal design 
based on the LMMSE estimator. 

Although the Robbins-Monro algorithm in theory only con- 
verges asymptotically, in practice we see that a hard stopping 
criterion may work well. The algorithm is still computationally 
demanding, and therefore best suited under stationary or near 
stationary settings. 

We have explored an interesting implication of the non- 
convexity of the MMSE; namely a case where spending more 
pilot power gives worse channel estimates. This phenomenon 
is not linked to the stochastic optimization procedure. It can 
be observed without optimizing H at all, and we have offered 
a plausible explanation. 
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Appendix 

This appendix derives a closed form expression for the 
gradient direction g^ y ^ in (l22l >. where G (H, y) is defined 
through dT6b-(fl~8T>. To that end, it is worth observing that when 
optimizing H it is beneficial if the designer can draw samples 
directly from the inputs x and n, and not only the output y. 
In order to see why, assume in what follows that the order of 
derivation and integration can be interchanged such that 



dg(H) 
<9H 



d 

dH 



G(H,y)/(y)dy 
= / Jj[G(H, y )/(y)dy] 



(29) 



Now, if we can only observe outputs y, we have 



E 



dG(H,y) 
<9H 



3G(H ' y) /M^ a9(H) 



<9H 



Hence, in this case, the update direction 8G ^ y ^ is not an 
unbiased estimator of the gradient d9 g^ ■ In contrast, assume 
that we can draw inputs (x, n), and define 



then 




|u x |Hx+n|| 2 = G(Hx + n) : 

dG (Hx + n) 
dH 

G(Hx + n)/(x)dx/(n)dn 



f(x)dxf(n)dn 

= dg(H) 
<9H ' 



Here, the second equality holds because /(x)efoc/(n)<in is 
independent of H. Hence, dG (*^+n) ^ an un ^j ase( j estimator 
of 9g J£P , which is of course desirable. Because it is beneficial 
to sample x and n, rather than just y, we will assume here 
that the designer can do this. In practice, this implies that 
the optimization of H is done off line, as preparation for the 
subsequent estimation. 

In what follows, we will prove that interchanging the order 
of integration and derivation, as in d29l ) is justifiable. We will 
derive a closed form expression 

for ^g±n> ; and 

use this as 

the update direction in (|22| >. Our strategy, however, will be to 
do this in the reverse order: First we compute the derivative, 
assuming that the change can be done, and then we show that 
that differentiation under the integral sign is justified. Although 
we assume knowledge of (x, n) for each observed y, we will 
write y instead of Hx + n, and dG ^ y ^ instead of 8G (^x+") t 
simply to save space. 

Using ([T6]l, ° G gH' y ' 1 can be written as 

_ d ( f^(y)f^(y)u%fu^ 

(30) 

In order to compute ( f30b . we make use of the following 
theorem |[T3l . 



Theorem 1: For a scalar function, 0(H), of a matrix argu- 
ment, the differential has the form 

d(4>) = tr (Q T d(H)) = vec(Q) T vec(dH), 
whel -e Q = _. 

In our case, we take 0(H) to be the expression in the large 
parenthesis of ( |30l >. We will identify its differential, and exploit 
Theorem Q] in order to obtain the derivative. For that purpose, 
it is convenient to define 



f k ' l ' r ' s = f {k ' l) (y)f (r ' s) (y), 

x\y x\y ' 

t = ^2p k qif {k > l \y). 

k,l 



(31) 
(32) 
(33) 



Using these, the derivative in ( |30i >, can then be compactly 
— ' ' Using the chain rule, the dif- 



wntten as 



on 



ferential of this fraction is 
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d(4>) = d 



k,l,r,s k,l,r,i 



2 fk,l,r,s z k,l,r.s d ^ 



t 2 J t 3 

d (fk,l,r,8} z k,l,r,s + d ( z k,l,r,sjfk,l,r„ 

1* 



(34) 



Thus, we must identify the differentials d(f k ' l,r ' s ), d(z k,l ' r,s ) 
and d(t), which we do in next. Notation: we will in the 
remainder of this appendix use (•) to compactly denote the 
trace operator. 

Computing d(/ fc ' i,r ' s ) 

d(/ M ' r ' s )=rf(/^ 3 (y)/ ( ^(y)) 
= d (f^ (y)) (y) + (y)d (f^ (y)) . (35) 

The differential d (f (k ' l) {y)), is a differential of a Gaussian 
probability density function. In our case it depends on the 
indexes (k, I), but in order to enhance readability, we will 
disregard these indexes in what follows. Thus, for now, 
we will use equations ([T2]i-(fT7]> with all indexes removed, 
and reincorporate the indexes when needed. In addition, we 
will for now disregard the constant factor (2ir)~~ in ( fT3l ). 
Hence, instead of considering the differential d(/'*'^(y)), 
we therefore consider d^|C yy | 'g(y)^, where <?(y) = 
e -^(y- u y) Tc yy(y- u y). This can be written as 



d(\C yy \- 

= d (iCy, 

= _5(y) 

2 

g(y) 



■«?(y) 



1 )g(y) 



^yy 



d(g(y)) 



Cyy| 2 d (|C yy |) 



2~ |Cyyl 



^((y-u y ) T Cr y (y-u : 



(36) 



In the second equality we have used the chain rule, and 
exploited that g(y) is an exponential function. The first 
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differential in d36l ), provided C yy is full rank, is (Theorem 

i, ch. 8, of ni) 

rf(|C yy |) = |C yy |(C yy d(C yy )) (37) 

= |Cyy| (C-^d (HC XX H T + C nn )) 

= |Cyy| (Cy-y 1 (d(H)C xx H T + HC xx d (H T ))) 

(38) 

= |Cyy| (CxxI^C^^H) + d(H)C xx H T C yy ) 

(39) 

= |Cyy| (C xx Ii T C^d(H) + C xx H T C yy 1 d(H)) 

(40) 

= 2\c yy \(c^n T c^d(n)) 

= 2|C yy |((C yy HC xx ) T d(H)). (41) 

In j39) , we have rotated the first trace (done a cyclic per- 
mutation of the matrix product), and transposed the second 
trace. Because C xx and Cyy are symmetric, they are not 
affected by transposition. Moreover, d(H T ) = (d(H)) T . The 
trace operator is invariant to such rotations and transposition, 
and therefore these operations are justified. In (l40l > we have 
rotated the second term. Such rotations and transpositions will 
be frequently employed throughout. Introducing w = y u y , 
the second differential of (|36l l can be written 

d(w T C yy 1 w)=d((w T C y - y 1 w)) 
= (w T d (C^ 1 ) w) + 2 (w T C" y d (w)) . (42) 

The first term of (l42l is 

(^(Cyyjw) (43) 

= -(w T C yy 1 d(C yy )C yy 1 w) (44) 
= - (w T C yy (rf(H)C xx H T + HC xx d (H T )) C^w) 

= - (CyyWW^yy 1 (^^0,^ + (H T ))) . 

(45) 

Equation d44l > results from Theorem 3, ch. 8, of [ 13 1. Observe 
that C y y ww T C yy in (@5]l is a symmetric matrix, playing the 
same role as C yy in ( f38l >. Therefore, we can utilize d4TI) and 
conclude that 

<w r d(c£)w) = -2((C yy 1 ww T C yy 1 HC xx ) T d(H)). 

(46) 

Recall that w = H(x — u x ) + n u n . The second term of 
(l42l can therefore be written as 

2(w T C yy d(w)) 
= 2(w T C yy 1 d(H) (x-u x )) 
= 2((x-u x )w T C yy 1 d(H)) 

= 2/(c yy 1 w(x-u x ) T ) T d(H)V (47) 



Using d4Tl >.(l4"6l> and (|47T >. and inserting into (l36l l. we find that 

d(|C yy |-' 5 (y)) 

= -g(y) |c yy p' ((c y ;HC xx ) T d(H)) 
+ g(y) iCyyp* ((c y >w T c yy 1 HC xx ) T d(H)) 

- 9(y) iCyyl-'^C^w (x - u x ) T ) T d(H)^ . (48) 
If we define 

R (*,o = c - <k,i) w (k,i) ( X _ U «) T 

+ C y ^ (i - w^w^ T C-^) HC<g, 

where — y— u y k ' l \ and reincorporate the constant factor 

(2tt)~~, we now find that 

d(f (k ' l) (y)) =-/(*-') (y)/(RW) T d(H)V (49) 
Accordingly, ( 1351 ) becomes 

d ^*,i,r,.) _ (^fk,l,r, S ^ R (fe,/) + R(ns)j T d ( H )\ _ ( 50) 

Computing d(z h ' l ' r ' s ) 

* 1 '")=j(u«V;,-') 

(51) 

Apart from a rearrangement of the indexes, equation ( BP ) 
contains two similar terms. Hence it suffices to compute one 
of them. Recalling that u|\y is defined by ( fTTj i, we focus on 
the differential 

(^ )T «?)) 

= (uj^d (u« + cSH r C^)wW))j 

= (ui|y )T CWrf(H^)C y W) W (M)^ (52) 

+ (ujfcSH^d (C yy fe ^) w^A (53) 

+ ^ T C«H r C y ^>d (w(W)j. (54) 
We will resolve this term by term. The first term, d52l . reads 
(^fc^d^C^^ 

= /(c^w^uifcSj'^Hjj . (55) 
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The second term, d53l ), can be written as 



u ( 7 s) C«H T d 

x|y xx 



x|y xi ^yy y yy y yy 



= - (c y W) w (y) u H T C WH T C^-') d (c&'>) 

C (fe,(,r, S ) 

= - ^C( fe >'^ s ) (d (H) CgH T + HCgd (H T 
|c< k x )H T (c( fe ^ s ) + c( fc '^' s ) T ) d(H)) 

C (k,i,r,«) + c (k,i,r,s) T \ HC«) T d(H)\. (56) 



The third term, d54), reads 



= (u^ C«H r C y ^>d(H)(x-u 



C^*.0HCWui^(x-i4«) r ) d(H)\. (57) 



Using (I551 l. (l561 l and d57] i we now define 



U ^yy w u x y b xx 



Due to its two similar terms, the differential in (fSTjl can then 
be written 



d(z k > l ' r > s ) = MD( fc ' l ' r ' 8 )+D( r ' s ' i ' , )) T rf(H)\ . (58) 



Computing d(t) 



d(t) = d 5> fc g ; /( fc ^(y) = Y,P^ d (/ (fc ' } (y) 
y fc,i J k,i 

= -Ew9«/ W) (y)((R (W) ) T ^)). (59) 



fc.i 



The last equation results immediately by employing j49l . 



A. Computing the derivative 

Utilizing d50b , (l58l l and d59l , the complete differential in 
can now be written as 



jk,l,r,s ^.k,l,r,t 



d(4>) = d 



t 2 



fk,Lr,s ( R (fc,i) +R(r,s)) T rf(H; 



fe,Lr,s 



r D (fe,i,r, S ) + D (r, S ,k,9) T d(H) ) / 



t 2 



2f k,Lr,s z k,Lr,s £ p fc ^/(*=,0( y ) / ( R (*,0 ) T d(H) 



f 3 



(60) 



In case of a precoder design problem, one makes the following 
substitutions: H = BF and d(H) = Bd(F) throughout. In 
case of the pilot design problem ( [Tol l, H must be substituted 
by S T ®I TO . In addition, assuming that S is n x r, one makes 
use of the fact that 

vec(dH) = vec (d(S T ) <g> I m ) 

= (I„ ® K mr <E) I m ) (I rn ® vec(I m )) d(vec(S T )). 

Here K mr is the Magnus and Neudecker commutation matrix 
ff3l . Theorem Q] can then be easily applied to d60l ), and iden- 
tifying the derivative in ( f30] > is therefore now straightforward. 

Finally, assume that H in d60b is not a function of some 
other matrix. Complactly defining PkqiPrqsf k ' l ' r ' s = h k ' Lr ' s , 
and observing that J^k i r s h ' ' r ' s = t 2 , we find from equa- 
tions <[30j>, <[60j, and Theorem Q] that 

dG (Hx + n) 



<9H 

Efe l r s hk ' l ' r ' S ( R(fe ' + R (r ' S) ) 2 fc ' i ' r ' S 



E/,fc.Z,r.s 
fe,z>n* ' 

j^k,l,r,s ^"Q(fc,/.r,s) _j_ J)(r,s,fe,Z) ^ 

2 E fc|t ,r, fl S M (y)R ( 

E 4J Pi«7/ (iJ ' ) (y)E fc , I , r ,.fc fc ' , ' r ' i 



(61) 



B. Interchanging the order of derivation and integration 

Recall, that interchanging the order of derivation and inte- 
gration, as in ( |29l , was until now only assumed valid. It derives 
from Lebesgue's Dominated Convergence Theorem that such 
a change is valid if there exists a dominating function v(-) 
satisfying 



dG (Hx + n) 



<9H 



< IKH,x,n) 



(62) 



and 



J J |KH,x,n)|| 2 /(x)/(n)dxdn<oo. (63) 
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Now consider d6Tb and define the function 

ufc,,, r , a (H,x,n) = - (r^ + R< r '»>) z k ^ s 

2z M,^ fc ^ fcg;/ (M)(y) R (M) 

Observe that 

Efc,i,,-, s /l ' C,Z ' r ' Su; fe^,'-^( H ' x ' n ) _ <9G(Hx + n) 



2^,k,l,r.s " 



<9H 



Hence 8G (H^c+n) ^ s a convex combination of the 
Wfc j,r,a(H, x, n)'s, and therefore the function 

u(H,x,n) = ^ ||iy fe>i , r , s (H,x,n)|| 2 

clearly satisfies d62l l. We do not explcitly prove it her, but it 
can be verified that the integral 

]T |K,, r , a (H,x,n)|| a /(x)/(n)dxdn (64) 

k,l,r,s 

is bounded. Hence a dominating function exists, and the 
change of integration and derivation is justified. 

C. First and second order derivatives of the objective function 
When trying to compute 



dG (Hx + n) 
<9H 



/(x)/(n)dxdn, 



(65) 



the mixture densities in the denominators of d6"TT > will not sim- 
plify by substitutions. An entirely similar argument provides 
the reason for why (fT9b cannot be computed analytically in the 
first place J2). Hence, d65l ) cannot be computed analytically, 
and a closed form derivative of ([T9T l w.r.t H does not exist. 
Although not demonstrated here, a similar argument will hold 
also for the second order derivative. 
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